SSPS4102 Data Analytics in the Social Sciences
SSPS6006 Data Analytics for Social Research
Semester 1, 2026
Last updated: 2026-01-23
I would like to acknowledge the Traditional Owners of Australia and recognise their continuing connection to land, water and culture. The University of Sydney is located on the land of the Gadigal people of the Eora Nation. I pay my respects to their Elders, past and present.
By the end of this lecture, you will be able to:
Key Readings
TSwD: Ch 12.1-12.2 | ROS: Ch 6-7
Linear models have been used for centuries to understand relationships in data.
Historical origins:
Modern uses:
At a fundamental level, regression has two purposes:
Key Insight
Regression is fundamentally a technology for prediction and comparison - not necessarily for identifying causal effects.
The simplest regression model is linear with a single predictor:
\[y = a + bx + \epsilon\]
Where:
How do we find the “best” line through the data?
Many lines could be drawn, but we want the one that fits best.
Criterion: Minimise the Residual Sum of Squares (RSS)
\[RSS = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \sum_{i=1}^{n}e_i^2\]
Where:
For simple linear regression, the least squares estimates are:
\[\hat{b} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\]
\[\hat{a} = \bar{y} - \hat{b}\bar{x}\]
Good News!
You don’t need to calculate these by hand - R does it for you with lm()
lm() FunctionIn R, we use lm() (linear model) to fit regressions:
Key components:
outcome ~ predictor: The formula (outcome on left, predictor on right)data = dataset: The data frame containing your variablesLet’s simulate data about the relationship between 5km run time and marathon time:
set.seed(853)
num_observations <- 200
expected_relationship <- 8.4 # Marathon is ~8.4x longer than 5km
sim_run_data <- tibble(
five_km_time = runif(n = num_observations, min = 15, max = 30),
noise = rnorm(n = num_observations, mean = 0, sd = 20),
marathon_time = five_km_time * expected_relationship + noise
) |>
mutate(
five_km_time = round(five_km_time, 1),
marathon_time = round(marathon_time, 1)
) |>
select(-noise)
Call:
lm(formula = marathon_time ~ five_km_time, data = sim_run_data)
Residuals:
Min 1Q Median 3Q Max
-49.289 -11.948 0.153 11.396 46.511
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.4692 6.7517 0.662 0.509
five_km_time 8.2049 0.3005 27.305 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.42 on 198 degrees of freedom
Multiple R-squared: 0.7902, Adjusted R-squared: 0.7891
F-statistic: 745.5 on 1 and 198 DF, p-value: < 2.2e-16
Coefficients:
Model fit:
We recovered the truth!
True slope was 8.4, estimate is 8.2 ± 0.3 (standard error)
(Intercept) five_km_time
4.469242 8.204932
five_km_time
8.204932
The regression equation is:
\[\text{Marathon time} = 4.47 + 8.20 \times \text{5km time}\]
Critical Insight
Regression coefficients are commonly called “effects,” but this can be misleading. We should think of them as comparisons, not causal effects.
What the slope really means:
“Comparing runners whose 5km times differ by one minute, we find their marathon times differ, on average, by about 8.2 minutes.”
This is a between-person comparison, not a within-person effect!
Consider a regression predicting earnings from height and sex:
\[\text{earnings} = -26.0 + 0.6 \times \text{height} + 10.6 \times \text{male}\]
Tempting but wrong interpretations:
Better interpretations:
The height-earnings regression shows:
Possible explanations (all consistent with the data):
Bottom Line
Regression tells us about associations, not causes. Causal interpretation requires additional assumptions and designs (Week 12).
The residual standard deviation tells us about prediction accuracy:
Interpretation:
This comes from the normal distribution properties we learned in earlier weeks.
\[R^2 = 1 - \frac{\text{Variance of residuals}}{\text{Variance of outcome}}\]
[1] 0.7901541
[1] 0.7901541
Interpretation: 79% of the variation in marathon times is explained by 5km times.
What R² tells us:
Useful for:
What R² doesn’t tell us:
Be cautious:
What we want: Residuals centred around zero, roughly symmetric, no patterns
predict()Once we have a fitted model, we can make predictions:
fit lwr upr
1 168.5679 165.8405 171.2953
fit lwr upr
1 168.5679 134.1013 203.0345
Key difference:
The broom package provides three key functions for working with models:
| Function | Purpose | Returns |
|---|---|---|
tidy() |
Coefficient estimates | One row per term |
glance() |
Model-level statistics | One row per model |
augment() |
Observation-level stats | Original data + fitted values |
tidy(): Coefficient Table# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.47 6.75 0.662 5.09e- 1 -8.85 17.8
2 five_km_time 8.20 0.300 27.3 4.70e-69 7.61 8.80
This is much easier to work with than raw summary() output!
glance(): Model Summary# A tibble: 1 × 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.790 0.789 17.4 746. 4.70e-69 1 -854. 1715. 1725.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Key columns:
r.squared: Proportion of variance explainedsigma: Residual standard deviationstatistic, p.value: F-test for overall model significanceaugment(): Observation-Level Data# A tibble: 6 × 8
marathon_time five_km_time .fitted .resid .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 164. 20.4 172. -8.05 0.00585 17.5 6.32e-4 -0.463
2 158 16.8 142. 15.7 0.0133 17.4 5.55e-3 0.906
3 196. 22.3 187. 8.16 0.00501 17.5 5.55e-4 0.470
4 160. 19.7 166. -5.81 0.00670 17.5 3.77e-4 -0.334
5 121. 15.6 132. -11.6 0.0175 17.4 4.00e-3 -0.670
6 178. 21.1 178. 0.607 0.00529 17.5 3.24e-6 0.0349
Key columns: .fitted (predicted values), .resid (residuals), .hat (leverage), .cooksd (influence)
Fake-Data Simulation
Simulating data where we know the truth helps us:
“The most valuable benefit of doing fake-data simulation is that it helps you build and then understand your statistical model.” — Gelman, Hill & Vehtari (2020)
# Step 1: Set true parameters
a <- 46.3 # True intercept
b <- 3.0 # True slope
sigma <- 3.9 # True residual SD
# Step 2: Generate fake data
x <- c(0.1, 3.2, 2.9, 3.8, 1.3, 4.0, 2.2, 1.0, 2.7, 0.7, 3.9, 2.6, 1.9, 1.5, 3.4, 2.0)
n <- length(x)
# Step 3: Simulate outcomes
set.seed(123)
y <- a + b * x + rnorm(n, 0, sigma)Now we have data where we know the true relationship is \(y = 46.3 + 3.0x + \epsilon\)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 44.1 1.84 23.9 9.35e-13
2 growth 4.36 0.710 6.14 2.54e- 5
| Parameter | True Value | Estimate | Within 2 SE? |
|---|---|---|---|
| Intercept | 46.3 | 44.1 | ✓ |
| Slope | 3.0 | 4.4 | ✓ |
If we repeat this process many times, we expect:
Francis Galton noticed something curious about height:
Children of tall parents tend to be taller than average…
…but shorter than their parents
Children of short parents tend to be shorter than average…
…but taller than their parents
This is regression to the mean - where the term “regression” comes from!
Apparent paradox: If heights regress to the mean, won’t variation disappear?
Resolution:
Key Insight
Regression to the mean occurs whenever predictions are imperfect. It’s a mathematical fact, not a causal process.
Consider students taking midterm and final exams:
Wrong interpretation: High performers get lazy, low performers work harder
Correct interpretation: This is regression to the mean - both exams measure ability imperfectly, and extreme scores tend to be partly due to luck
Real Example
Flight instructors found pilots improved after criticism and got worse after praise. Actually, this was just regression to the mean - no causal effect of feedback!
Interpretation errors:
Technical issues:
Always visualise your data before and after fitting
Interpret cautiously - use comparison language, not causal language
Report uncertainty - coefficients without standard errors are incomplete
Check assumptions - examine residuals for patterns
Simulate - if unsure how something works, simulate it!
What we learned:
Key R functions:
lm() - fit modelssummary(), coef() - examine resultspredict() - make predictionsbroom::tidy(), glance(), augment() - tidy outputresiduals(), fitted() - diagnosticsWeek 7: Multiple Regression
Preparation
Read TSwD Ch 12.3-12.4 and ROS Ch 9-10